The idea in this notebook is to reduce the dimensionality of the datasets by transforming individual features using classifiers. Once we've done this it will be possible to combine the subject specific datasets into a single global dataset. This might run the risk of overfitting, but it is also a nice way to create a global classifier.

Loading the data and initialisation

Same initialisation steps as in other notebooks:


In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.rcParams['figure.figsize'] = 6, 4.5
plt.rcParams['axes.grid'] = True
plt.gray()


<matplotlib.figure.Figure at 0x7f3e64700470>

In [2]:
cd ..


/home/gavin/repositories/hail-seizure

In [3]:
import train
import json
import imp

In [4]:
settings = json.load(open('SETTINGS.json', 'r'))

In [6]:
data = train.get_data(settings['FEATURES'][:3])

In [7]:
!free -m


             total       used       free     shared    buffers     cached
Mem:         11933      11220        712        384        355       3429
-/+ buffers/cache:       7435       4497
Swap:        12287         34      12253

Random forest supervised classification

For each feature and each subject we want to train a random forest and use it to transform the data. We also want to appropriately weight the samples due to the unbalanced classes.

Since I'm a big fan of dictionaries it seems like it would be easy to do this with a dictionary iterating over subjects and features and saving predictions.


In [8]:
import sklearn.preprocessing
import sklearn.pipeline
import sklearn.ensemble
import sklearn.cross_validation
from train import utils

In [9]:
imp.reload(utils)


Out[9]:
<module 'python.utils' from '/home/gavin/repositories/hail-seizure/python/utils.py'>

Below code copied and modified from random forest submission notes:


In [12]:
features = settings['FEATURES'][:3]

In [11]:
subjects = settings['SUBJECTS']

In [13]:
scaler = sklearn.preprocessing.StandardScaler()
forest = sklearn.ensemble.RandomForestClassifier()
model = sklearn.pipeline.Pipeline([('scl',scaler),('clf',forest)])

In [36]:
%%time
predictiondict = {}
for feature in features:
    print("Processing {0}".format(feature))
    for subj in subjects:
        # training step
        X,y,cv,segments = utils.build_training(subj,[feature],data)
        for i, (train, test) in enumerate(cv):
            weight = len(y[train])/sum(y[train])
            weights = np.array([weight if i == 1 else 1 for i in y[train]])
            model.fit(X[train],y[train],clf__sample_weight=weights)
            predictions = model.predict_proba(X)
            for name,split in [('train',train),('test',test)]:
                for segment,prediction in zip(segments[split],predictions[split]):
                    try:
                        predictiondict[segment][feature] = {}
                        predictiondict[segment][feature][i] = (name,prediction)
                    except:
                        predictiondict[segment] = {}
                        predictiondict[segment][feature] = {}
                        predictiondict[segment][feature][i] = (name,prediction)


Processing ica_feat_var_
Processing ica_feat_cov_
Processing ica_feat_corrcoef_
CPU times: user 3 s, sys: 10 ms, total: 3.01 s
Wall time: 3.01 s

Next, creating the full training set for a single train/test iteration:


In [37]:
segments = list(predictiondict.keys())

In [38]:
predictiondict[segments[0]].keys()


Out[38]:
dict_keys(['ica_feat_var_', 'ica_feat_cov_', 'ica_feat_corrcoef_'])

In [40]:
X = np.array([])[np.newaxis]
train,test = [],[]
for i,segment in enumerate(segments):
    row = []
    for feature in features:
        cv = list(predictiondict[segment][feature].keys())
        row.append(predictiondict[segment][feature][cv[0]][-1][-1])
        name = predictiondict[segment][feature][cv[0]][0]
        if name == 'train':
            train.append(i)
        elif name == 'test':
            test.append(i)
        else:
            print("segment {0} does not have a valid label: {1}".format(i,name))
    try:
        X = np.vstack([X,np.array(row)[np.newaxis]])
    except:
        X = np.array(row)[np.newaxis]

In [41]:
X


Out[41]:
array([[ 0. ,  0.1,  0. ],
       [ 0. ,  0. ,  0. ],
       [ 0.1,  0. ,  0. ],
       ..., 
       [ 0. ,  0.1,  0.1],
       [ 0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0. ]])

In [42]:
y = [1 if 'preictal' in segment else 0 for segment in segments]

In [43]:
y = np.array(y)

In [44]:
len(y)


Out[44]:
4067

In [45]:
len(X)


Out[45]:
4067

In [46]:
len(segments)


Out[46]:
4067

In [48]:
model.fit(X[train],y[train])
predictions = model.predict_proba(X[test])
score = sklearn.metrics.roc_auc_score(y[test],predictions[:,1])
print(score)


0.784804226469

In [52]:
sum(predictions)


Out[52]:
array([ 9684.86943723,   479.13056277])

Saving this operation as a script

We will probably want to be able to do this again, so we should save this operation as a function in utils.py. Will do this once it works.